Packages

Below are all packages used for the project.

library(here)
## here() starts at C:/Users/noahr/Desktop/PSTAT131
library(readxl)
library(writexl)
library(dplyr) #for pipe function
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(visdat) #for visualizing missing data
library(ggplot2)
library(corrr)
library(corrplot)
## corrplot 0.92 loaded
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.4     ✔ rsample      1.1.1
## ✔ dials        1.1.0     ✔ tibble       3.2.0
## ✔ infer        1.0.4     ✔ tidyr        1.3.0
## ✔ modeldata    1.1.0     ✔ tune         1.0.1
## ✔ parsnip      1.0.4     ✔ workflows    1.1.3
## ✔ purrr        1.0.1     ✔ workflowsets 1.0.0
## ✔ recipes      1.0.5     ✔ yardstick    1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(ISLR)
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ stringr   1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-6
library(themis) #for step_upsample
library(vip)
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
tidymodels_prefer()

Introduction

In this project, I will try to predict race results from a wide variety of predictors for the 2022 Mens World Tour in professional road cycling. Road cycling is notoriously unpredictable so it will be interesting to see how a machine learning algorithm tackles this problem. Predictors will be divided into two general categories. The first category is rider profile. Rider profile includes variables such as rider age, rider weight, and rider ranking in a variety of different strengths. In pro cycling, riders generally specialize. There are riders who are designated sprinters. These riders have a very good 30 second power but are generally larger and do not have good power over longer ranges. This means that these riders are good at sprinting to a win in a flat race but get dropped when there is a hill. There are other riders who are very light and have good long power. These riders are excellent at climbing but cannot compete against the sprinters. There are many riders who specialize somewhere between a sprinter and a climber. The other category of predictor is race profile. This includes attributes such as race length, vertical meters covered, race ranking, and more. Throughout the year, there are certain races that rank higher than other races. For example, most people have heard of the Tour de France but only intense cycling fans know of races such as the Bemer Classic. These higher quality races have better riders starting at them and are thus more prestigious to win raising the level of competition. A rider who could do well at the Bemer Classic might struggle to do well at the Tour de France. By combining rider profile and racer profile, I hope to create an algorithm that is able to provide insights that are missed by most cycling commentators. Since races are so unpredictable, there are many variables that are difficult to quantify and are thus missing from this analysis. For example, there is nothing to quantify what happened in a race known as Strade Bianche in 2022 when strong winds blew half of the competitors off of the course causing some spectacular crashes.
A series of unfortunate events (all riders were relatively ok) Let’s see how the models do.

Data Cleaning

Read in data

First, we read in all of the various specific race result data and merge the datasets to create our final dataset.

Omloop <- read_excel(here("Project", "rawData", "AllRaceResults", "2_Omloop.xlsx"))
Omloop$raceName <- "Omloop"
Omloop <- select(Omloop, Rnk, Rider, raceName)

Strade <- read_excel(here("Project", "rawData", "AllRaceResults", "3_strade.xlsx"))
Strade$raceName <- "Strade"

#use this to combine datasets
raceResults <- select(Strade, Rnk, Rider, raceName) %>% rbind(Omloop)

MSR <- read_excel(here("Project", "rawData", "AllRaceResults", "6_MSR.xlsx"))
MSR$raceName <- "MSR"
raceResults <- select(MSR, Rnk, Rider, raceName) %>% rbind(raceResults)

BruggeDePanne <- read_excel(here("Project", "rawData", "AllRaceResults", "8_BruggeDePanne.xlsx"))
BruggeDePanne$raceName <- "BruggeDePanne"
raceResults <- select(BruggeDePanne, Rnk, Rider, raceName) %>% rbind(raceResults)

E3 <- read_excel(here("Project", "rawData", "AllRaceResults", "9_E3.xlsx"))
E3$raceName <- "E3"
raceResults <- select(E3, Rnk, Rider, raceName) %>% rbind(raceResults)

GentWevelgem <- read_excel(here("Project", "rawData", "AllRaceResults", "10_GentWevelgem.xlsx"))
GentWevelgem$raceName <- "GentWevelgem"
raceResults <- select(GentWevelgem, Rnk, Rider, raceName) %>% rbind(raceResults)

DwarsDoor <- read_excel(here("Project", "rawData", "AllRaceResults", "11_DwarsDoor.xlsx"))
DwarsDoor$raceName <- "DwarsDoor"
raceResults <- select(DwarsDoor, Rnk, Rider, raceName) %>% rbind(raceResults)

RVV <- read_excel(here("Project", "rawData", "AllRaceResults", "12_RVV.xlsx"))
RVV$raceName <- "RVV"
raceResults <- select(RVV, Rnk, Rider, raceName) %>% rbind(raceResults)

AmstelGold <- read_excel(here("Project", "rawData", "AllRaceResults", "14_AmstelGold.xlsx"))
AmstelGold$raceName <- "AmstelGold"
raceResults <- select(AmstelGold, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisRoubaix <- read_excel(here("Project", "rawData", "AllRaceResults", "15_ParisRoubaix.xlsx"))
ParisRoubaix$raceName <- "ParisRoubaix"
raceResults <- select(ParisRoubaix, Rnk, Rider, raceName) %>% rbind(raceResults)

Fleche <- read_excel(here("Project", "rawData", "AllRaceResults", "16_Fleche.xlsx"))
Fleche$raceName <- "Fleche"
raceResults <- select(Fleche, Rnk, Rider, raceName) %>% rbind(raceResults)

LBL <- read_excel(here("Project", "rawData", "AllRaceResults", "17_LBL.xlsx"))
LBL$raceName <- "LBL"
raceResults <- select(LBL, Rnk, Rider, raceName) %>% rbind(raceResults)

EschbornFrankfurt <- read_excel(here("Project", "rawData", "AllRaceResults", "19_EschbornFrankfurt.xlsx"))
EschbornFrankfurt$raceName <- "EschbornFrankfurt"
raceResults <- select(EschbornFrankfurt, Rnk, Rider, raceName) %>% rbind(raceResults)

SanSebastian <- read_excel(here("Project", "rawData", "AllRaceResults", "24_SanSebastian.xlsx"))
SanSebastian$raceName <- "SanSebastian"
raceResults <- select(SanSebastian, Rnk, Rider, raceName) %>% rbind(raceResults)

Bemer <- read_excel(here("Project", "rawData", "AllRaceResults", "27_Bemer.xlsx"))
Bemer$raceName <- "Bemer"
raceResults <- select(Bemer, Rnk, Rider, raceName) %>% rbind(raceResults)

Bretagne <- read_excel(here("Project", "rawData", "AllRaceResults", "28_Bretagne.xlsx"))
Bretagne$raceName <- "Bretagne"
raceResults <- select(Bretagne, Rnk, Rider, raceName) %>% rbind(raceResults)

Quebec <- read_excel(here("Project", "rawData", "AllRaceResults", "29_Quebec.xlsx"))
Quebec$raceName <- "Quebec"
raceResults <- select(Quebec, Rnk, Rider, raceName) %>% rbind(raceResults)

Montreal <- read_excel(here("Project", "rawData", "AllRaceResults", "30_Montreal.xlsx"))
Montreal$raceName <- "Montreal"
raceResults <- select(Montreal, Rnk, Rider, raceName) %>% rbind(raceResults)

Lombardia <- read_excel(here("Project", "rawData", "AllRaceResults", "31_Lombardia.xlsx"))
Lombardia$raceName <- "Lombardia"
raceResults <- select(Lombardia, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage1.xlsx"))
UAEStage1$raceName <- "UAEStage1"
raceResults <- select(UAEStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage2.xlsx"))
UAEStage2$raceName <- "UAEStage2"
raceResults <- select(UAEStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage3.xlsx"))
UAEStage3$raceName <- "UAEStage3"
raceResults <- select(UAEStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage4.xlsx"))
UAEStage4$raceName <- "UAEStage4"
raceResults <- select(UAEStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage5.xlsx"))
UAEStage5$raceName <- "UAEStage5"
raceResults <- select(UAEStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage6.xlsx"))
UAEStage6$raceName <- "UAEStage6"
raceResults <- select(UAEStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

UAEStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "1_UAE", "UAEStage7.xlsx"))
UAEStage7$raceName <- "UAEStage7"
raceResults <- select(UAEStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage1.xlsx"))
ParisNiceStage1$raceName <- "ParisNiceStage1"
raceResults <- select(ParisNiceStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage2.xlsx"))
ParisNiceStage2$raceName <- "ParisNiceStage2"
raceResults <- select(ParisNiceStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage3.xlsx"))
ParisNiceStage3$raceName <- "ParisNiceStage3"
raceResults <- select(ParisNiceStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage4.xlsx"))
ParisNiceStage4$raceName <- "ParisNiceStage4"
raceResults <- select(ParisNiceStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage5.xlsx"))
ParisNiceStage5$raceName <- "ParisNiceStage5"
raceResults <- select(ParisNiceStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage6.xlsx"))
ParisNiceStage6$raceName <- "ParisNiceStage6"
raceResults <- select(ParisNiceStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage7.xlsx"))
ParisNiceStage7$raceName <- "ParisNiceStage7"
raceResults <- select(ParisNiceStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

ParisNiceStage8 <- read_excel(here("Project", "rawData", "AllRaceResults", "4_ParisNice", "ParisNiceStage8.xlsx"))
ParisNiceStage8$raceName <- "ParisNiceStage8"
raceResults <- select(ParisNiceStage8, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage1.xlsx"))
TirrenoStage1$raceName <- "TirrenoStage1"
raceResults <- select(TirrenoStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage2.xlsx"))
TirrenoStage2$raceName <- "TirrenoStage2"
raceResults <- select(TirrenoStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage3.xlsx"))
TirrenoStage3$raceName <- "TirrenoStage3"
raceResults <- select(TirrenoStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage4.xlsx"))
TirrenoStage4$raceName <- "TirrenoStage4"
raceResults <- select(TirrenoStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage5.xlsx"))
TirrenoStage5$raceName <- "TirrenoStage5"
raceResults <- select(TirrenoStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage6.xlsx"))
TirrenoStage6$raceName <- "TirrenoStage6"
raceResults <- select(TirrenoStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

TirrenoStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "5_Tirreno", "TirrenoStage7.xlsx"))
TirrenoStage7$raceName <- "TirrenoStage7"
raceResults <- select(TirrenoStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage1.xlsx"))
CatalunyaStage1$raceName <- "CatalunyaStage1"
raceResults <- select(CatalunyaStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage2.xlsx"))
CatalunyaStage2$raceName <- "CatalunyaStage2"
raceResults <- select(CatalunyaStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage3.xlsx"))
CatalunyaStage3$raceName <- "CatalunyaStage3"
raceResults <- select(CatalunyaStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage4.xlsx"))
CatalunyaStage4$raceName <- "CatalunyaStage4"
raceResults <- select(CatalunyaStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage5.xlsx"))
CatalunyaStage5$raceName <- "CatalunyaStage5"
raceResults <- select(CatalunyaStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage6.xlsx"))
CatalunyaStage6$raceName <- "CatalunyaStage6"
raceResults <- select(CatalunyaStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

CatalunyaStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "7_Catalunya", "CatalunyaStage7.xlsx"))
CatalunyaStage7$raceName <- "CatalunyaStage7"
raceResults <- select(CatalunyaStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

ItzuliaStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "13_Itzulia", "ItzuliaStage1.xlsx"))
ItzuliaStage1$raceName <- "ItzuliaStage1"
raceResults <- select(ItzuliaStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

ItzuliaStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "13_Itzulia", "ItzuliaStage2.xlsx"))
ItzuliaStage2$raceName <- "ItzuliaStage2"
raceResults <- select(ItzuliaStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

ItzuliaStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "13_Itzulia", "ItzuliaStage3.xlsx"))
ItzuliaStage3$raceName <- "ItzuliaStage3"
raceResults <- select(ItzuliaStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

ItzuliaStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "13_Itzulia", "ItzuliaStage4.xlsx"))
ItzuliaStage4$raceName <- "ItzuliaStage4"
raceResults <- select(ItzuliaStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

ItzuliaStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "13_Itzulia", "ItzuliaStage5.xlsx"))
ItzuliaStage5$raceName <- "ItzuliaStage5"
raceResults <- select(ItzuliaStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

ItzuliaStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "13_Itzulia", "ItzuliaStage6.xlsx"))
ItzuliaStage6$raceName <- "ItzuliaStage6"
raceResults <- select(ItzuliaStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

RomandiePrologue <- read_excel(here("Project", "rawData", "AllRaceResults", "18_Romandie", "RomandiePrologue.xlsx"))
RomandiePrologue$raceName <- "RomandiePrologue"
raceResults <- select(RomandiePrologue, Rnk, Rider, raceName) %>% rbind(raceResults)

RomandieStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "18_Romandie", "RomandieStage1.xlsx"))
RomandieStage1$raceName <- "RomandieStage1"
raceResults <- select(RomandieStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

RomandieStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "18_Romandie", "RomandieStage2.xlsx"))
RomandieStage2$raceName <- "RomandieStage2"
raceResults <- select(RomandieStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

RomandieStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "18_Romandie", "RomandieStage3.xlsx"))
RomandieStage3$raceName <- "RomandieStage3"
raceResults <- select(RomandieStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

RomandieStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "18_Romandie", "RomandieStage4.xlsx"))
RomandieStage4$raceName <- "RomandieStage4"
raceResults <- select(RomandieStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

RomandieStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "18_Romandie", "RomandieStage5.xlsx"))
RomandieStage5$raceName <- "RomandieStage5"
raceResults <- select(RomandieStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage1.xlsx"))
GiroStage1$raceName <- "GiroStage1"
raceResults <- select(GiroStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage2.xlsx"))
GiroStage2$raceName <- "GiroStage2"
raceResults <- select(GiroStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage3.xlsx"))
GiroStage3$raceName <- "GiroStage3"
raceResults <- select(GiroStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage4.xlsx"))
GiroStage4$raceName <- "GiroStage4"
raceResults <- select(GiroStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage5.xlsx"))
GiroStage5$raceName <- "GiroStage5"
raceResults <- select(GiroStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage6.xlsx"))
GiroStage6$raceName <- "GiroStage6"
raceResults <- select(GiroStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage7.xlsx"))
GiroStage7$raceName <- "GiroStage7"
raceResults <- select(GiroStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage8 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage8.xlsx"))
GiroStage8$raceName <- "GiroStage8"
raceResults <- select(GiroStage8, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage9 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage9.xlsx"))
GiroStage9$raceName <- "GiroStage9"
raceResults <- select(GiroStage9, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage10 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage10.xlsx"))
GiroStage10$raceName <- "GiroStage10"
raceResults <- select(GiroStage10, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage11 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage11.xlsx"))
GiroStage11$raceName <- "GiroStage11"
raceResults <- select(GiroStage11, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage12 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage12.xlsx"))
GiroStage12$raceName <- "GiroStage12"
raceResults <- select(GiroStage12, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage13 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage13.xlsx"))
GiroStage13$raceName <- "GiroStage13"
raceResults <- select(GiroStage13, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage14 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage14.xlsx"))
GiroStage14$raceName <- "GiroStage14"
raceResults <- select(GiroStage14, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage15 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage15.xlsx"))
GiroStage15$raceName <- "GiroStage15"
raceResults <- select(GiroStage15, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage16 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage16.xlsx"))
GiroStage16$raceName <- "GiroStage16"
raceResults <- select(GiroStage16, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage17 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage17.xlsx"))
GiroStage17$raceName <- "GiroStage17"
raceResults <- select(GiroStage17, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage18 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage18.xlsx"))
GiroStage18$raceName <- "GiroStage18"
raceResults <- select(GiroStage18, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage19 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage19.xlsx"))
GiroStage19$raceName <- "GiroStage19"
raceResults <- select(GiroStage19, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage20 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage20.xlsx"))
GiroStage20$raceName <- "GiroStage20"
raceResults <- select(GiroStage20, Rnk, Rider, raceName) %>% rbind(raceResults)

GiroStage21 <- read_excel(here("Project", "rawData", "AllRaceResults", "20_Giro", "GiroStage21.xlsx"))
GiroStage21$raceName <- "GiroStage21"
raceResults <- select(GiroStage21, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage1.xlsx"))
DauphineStage1$raceName <- "DauphineStage1"
raceResults <- select(DauphineStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage2.xlsx"))
DauphineStage2$raceName <- "DauphineStage2"
raceResults <- select(DauphineStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage3.xlsx"))
DauphineStage3$raceName <- "DauphineStage3"
raceResults <- select(DauphineStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage4.xlsx"))
DauphineStage4$raceName <- "DauphineStage4"
raceResults <- select(DauphineStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage5.xlsx"))
DauphineStage5$raceName <- "DauphineStage5"
raceResults <- select(DauphineStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage6.xlsx"))
DauphineStage6$raceName <- "DauphineStage6"
raceResults <- select(DauphineStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage7.xlsx"))
DauphineStage7$raceName <- "DauphineStage7"
raceResults <- select(DauphineStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

DauphineStage8 <- read_excel(here("Project", "rawData", "AllRaceResults", "21_Dauphine", "DauphineStage8.xlsx"))
DauphineStage8$raceName <- "DauphineStage8"
raceResults <- select(DauphineStage8, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage1.xlsx"))
SuisseStage1$raceName <- "SuisseStage1"
raceResults <- select(SuisseStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage2.xlsx"))
SuisseStage2$raceName <- "SuisseStage2"
raceResults <- select(SuisseStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage3.xlsx"))
SuisseStage3$raceName <- "SuisseStage3"
raceResults <- select(SuisseStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage4.xlsx"))
SuisseStage4$raceName <- "SuisseStage4"
raceResults <- select(SuisseStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage5.xlsx"))
SuisseStage5$raceName <- "SuisseStage5"
raceResults <- select(SuisseStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage6.xlsx"))
SuisseStage6$raceName <- "SuisseStage6"
raceResults <- select(SuisseStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage7.xlsx"))
SuisseStage7$raceName <- "SuisseStage7"
raceResults <- select(SuisseStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

SuisseStage8 <- read_excel(here("Project", "rawData", "AllRaceResults", "22_Suisse", "SuisseStage8.xlsx"))
SuisseStage8$raceName <- "SuisseStage8"
raceResults <- select(SuisseStage8, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage1.xlsx"))
TDFStage1$raceName <- "TDFStage1"
raceResults <- select(TDFStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage2.xlsx"))
TDFStage2$raceName <- "TDFStage2"
raceResults <- select(TDFStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage3.xlsx"))
TDFStage3$raceName <- "TDFStage3"
raceResults <- select(TDFStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage4.xlsx"))
TDFStage4$raceName <- "TDFStage4"
raceResults <- select(TDFStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage5.xlsx"))
TDFStage5$raceName <- "TDFStage5"
raceResults <- select(TDFStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage6.xlsx"))
TDFStage6$raceName <- "TDFStage6"
raceResults <- select(TDFStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage7.xlsx"))
TDFStage7$raceName <- "TDFStage7"
raceResults <- select(TDFStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage8 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage8.xlsx"))
TDFStage8$raceName <- "TDFStage8"
raceResults <- select(TDFStage8, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage9 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage9.xlsx"))
TDFStage9$raceName <- "TDFStage9"
raceResults <- select(TDFStage9, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage10 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage10.xlsx"))
TDFStage10$raceName <- "TDFStage10"
raceResults <- select(TDFStage10, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage11 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage11.xlsx"))
TDFStage11$raceName <- "TDFStage11"
raceResults <- select(TDFStage11, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage12 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage12.xlsx"))
TDFStage12$raceName <- "TDFStage12"
raceResults <- select(TDFStage12, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage13 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage13.xlsx"))
TDFStage13$raceName <- "TDFStage13"
raceResults <- select(TDFStage13, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage14 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage14.xlsx"))
TDFStage14$raceName <- "TDFStage14"
raceResults <- select(TDFStage14, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage15 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage15.xlsx"))
TDFStage15$raceName <- "TDFStage15"
raceResults <- select(TDFStage15, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage16 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage16.xlsx"))
TDFStage16$raceName <- "TDFStage16"
raceResults <- select(TDFStage16, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage17 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage17.xlsx"))
TDFStage17$raceName <- "TDFStage17"
raceResults <- select(TDFStage17, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage18 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage18.xlsx"))
TDFStage18$raceName <- "TDFStage18"
raceResults <- select(TDFStage18, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage19 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage19.xlsx"))
TDFStage19$raceName <- "TDFStage19"
raceResults <- select(TDFStage19, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage20 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage20.xlsx"))
TDFStage20$raceName <- "TDFStage20"
raceResults <- select(TDFStage20, Rnk, Rider, raceName) %>% rbind(raceResults)

TDFStage21 <- read_excel(here("Project", "rawData", "AllRaceResults", "23_TDF", "TDFStage21.xlsx"))
TDFStage21$raceName <- "TDFStage21"
raceResults <- select(TDFStage21, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage1 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage1.xlsx"))
PolandStage1$raceName <- "PolandStage1"
raceResults <- select(PolandStage1, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage2.xlsx"))
PolandStage2$raceName <- "PolandStage2"
raceResults <- select(PolandStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage3.xlsx"))
PolandStage3$raceName <- "PolandStage3"
raceResults <- select(PolandStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage4.xlsx"))
PolandStage4$raceName <- "PolandStage4"
raceResults <- select(PolandStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage5.xlsx"))
PolandStage5$raceName <- "PolandStage5"
raceResults <- select(PolandStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage6.xlsx"))
PolandStage6$raceName <- "PolandStage6"
raceResults <- select(PolandStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

PolandStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "25_Poland", "PolandStage7.xlsx"))
PolandStage7$raceName <- "PolandStage7"
raceResults <- select(PolandStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage2 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage2.xlsx"))
VueltaStage2$raceName <- "VueltaStage2"
raceResults <- select(VueltaStage2, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage3 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage3.xlsx"))
VueltaStage3$raceName <- "VueltaStage3"
raceResults <- select(VueltaStage3, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage4 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage4.xlsx"))
VueltaStage4$raceName <- "VueltaStage4"
raceResults <- select(VueltaStage4, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage5 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage5.xlsx"))
VueltaStage5$raceName <- "VueltaStage5"
raceResults <- select(VueltaStage5, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage6 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage6.xlsx"))
VueltaStage6$raceName <- "VueltaStage6"
raceResults <- select(VueltaStage6, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage7 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage7.xlsx"))
VueltaStage7$raceName <- "VueltaStage7"
raceResults <- select(VueltaStage7, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage8 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage8.xlsx"))
VueltaStage8$raceName <- "VueltaStage8"
raceResults <- select(VueltaStage8, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage9 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage9.xlsx"))
VueltaStage9$raceName <- "VueltaStage9"
raceResults <- select(VueltaStage9, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage10 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage10.xlsx"))
VueltaStage10$raceName <- "VueltaStage10"
raceResults <- select(VueltaStage10, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage11 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage11.xlsx"))
VueltaStage11$raceName <- "VueltaStage11"
raceResults <- select(VueltaStage11, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage12 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage12.xlsx"))
VueltaStage12$raceName <- "VueltaStage12"
raceResults <- select(VueltaStage12, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage13 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage13.xlsx"))
VueltaStage13$raceName <- "VueltaStage13"
raceResults <- select(VueltaStage13, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage14 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage14.xlsx"))
VueltaStage14$raceName <- "VueltaStage14"
raceResults <- select(VueltaStage14, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage15 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage15.xlsx"))
VueltaStage15$raceName <- "VueltaStage15"
raceResults <- select(VueltaStage15, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage16 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage16.xlsx"))
VueltaStage16$raceName <- "VueltaStage16"
raceResults <- select(VueltaStage16, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage17 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage17.xlsx"))
VueltaStage17$raceName <- "VueltaStage17"
raceResults <- select(VueltaStage17, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage18 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage18.xlsx"))
VueltaStage18$raceName <- "VueltaStage18"
raceResults <- select(VueltaStage18, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage19 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage19.xlsx"))
VueltaStage19$raceName <- "VueltaStage19"
raceResults <- select(VueltaStage19, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage20 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage20.xlsx"))
VueltaStage20$raceName <- "VueltaStage20"
raceResults <- select(VueltaStage20, Rnk, Rider, raceName) %>% rbind(raceResults)

VueltaStage21 <- read_excel(here("Project", "rawData", "AllRaceResults", "26_Vuelta", "VueltaStage21.xlsx"))
VueltaStage21$raceName <- "VueltaStage21"
raceResults <- select(VueltaStage21, Rnk, Rider, raceName) %>% rbind(raceResults)

Next, we read in the race profiles and rider profiles. We then merge all of three of these dataframes together to make our final dataframe.

#read in the rider profile excel sheet
riderProfile <- read_excel(here("Project", "rawData", "RiderProfile.xlsx"))

#merge the rider profile df and the race results df
riderPlusRaceResults <- merge(riderProfile, raceResults, by.x="RiderName", by.y="Rider", all=TRUE)

#read in the race profile excel sheet
raceProfile <- read_excel(here("Project", "rawData", "raceProfile.xlsx"))

#merge the previously merged df and the raceProfile df
bikingDF <- merge(riderPlusRaceResults, raceProfile, by.x="raceName", by.y="RaceName")

Dealing with NAs

Now, we need to figure out how to deal with our NAs. Let’s visualize the missing data first.

#visualize missing data
vis_miss(bikingDF)

Overall, I have a very small amount of missing data. The most significant categories that were missing data was weight and height but even for these the percentage missing is at 2% and 1% respectively. I know from data collection that riders who were missing weight and height values generally did not play huge factors in races in general. There are lots of racers who fall into the category of not being a factor in races so losing these riders from my dataset will not be a big deal. The same goes with missing values in team, PCSRanking, and PCSTeamRanking. Thus, I will now remove all missing value observations. I am also going to remove all DNF, DNS, or OTL observations. The reasons behind receiving a “did not finish”, “did not start”, or “outside of time limit” are beyond the scope of this investigation. This also means that the Rnk column can be numeric.

#convert Rnk column from character entries to numeric entries
bikingDF$Rnk <- as.numeric(bikingDF$Rnk)
## Warning: NAs introduced by coercion
#all DNF, DNS, OTL are converted to NAs which are then removed by the na.omit function

#remove rows with NAs
bikingDF <- na.omit(bikingDF)

Exporting our data frames

Now, I will export the bikingDF so they can be used in other files.

#save bikingDF
write_xlsx(bikingDF, here("Project", "rawData", "bikingDF.xlsx"))

Exploratory Data Analysis

Packages and reading in data

Now I will read in the overall data but also the raceProfile and riderProfile data. I am reading in both raceProfile and riderProfile because these will be easier to conduct exploratory data analysis on.

#read in bikingDF
biking <- read_excel(here("Project", "rawData", "bikingDF.xlsx"))

#read in raceProfile
raceProfile <- read_excel(here("Project", "rawData", "raceProfile.xlsx"))

#read in riderProfile
riderProfile <- read_excel(here("Project", "rawData", "RiderProfile.xlsx"))

Exploring the biking dataframe

First I am going to explore the main biking dataframe.

#general exploration
str(biking)
## tibble [20,305 × 37] (S3: tbl_df/tbl/data.frame)
##  $ raceName              : chr [1:20305] "AmstelGold" "AmstelGold" "AmstelGold" "AmstelGold" ...
##  $ RiderName             : chr [1:20305] "VAN AVERMAET GregAG2R Citroën Team" "BETTIOL AlbertoEF Education-EasyPost" "APERS RubenSport Vlaanderen - Baloise" "HAIG JackBahrain - Victorious" ...
##  $ RiderLastName         : chr [1:20305] "Van Avermaet" "Bettiol" "Apers" "Haig" ...
##  $ Team                  : chr [1:20305] "AG2R Citroën Team" "EF Education-EasyPost" "Sport Vlaanderen - Baloise" "Bahrain - Victorious" ...
##  $ AgeYear               : num [1:20305] 37 29 24 29 23 25 35 29 25 26 ...
##  $ WeightKG              : num [1:20305] 74 69 70 67 75 67 77 70 66 73 ...
##  $ HeightM               : num [1:20305] 1.81 1.8 1.79 1.9 1.84 1.8 1.85 1.8 1.8 1.93 ...
##  $ OneDayRaceScore       : num [1:20305] 11470 1825 88 605 239 ...
##  $ GCScore               : num [1:20305] 3717 621 1 2287 171 ...
##  $ TimeTrialScore        : num [1:20305] 575 662 16 268 230 1 3 4 3 214 ...
##  $ SprintScore           : num [1:20305] 4951 221 50 178 137 ...
##  $ ClimberScore          : num [1:20305] 7664 1748 11 2250 706 ...
##  $ PCSRanking            : num [1:20305] 80 70 664 163 91 ...
##  $ Wins                  : num [1:20305] 41 4 0 2 0 0 0 1 0 3 ...
##  $ GrandTours            : num [1:20305] 12 6 0 9 4 0 2 8 3 2 ...
##  $ Classics              : num [1:20305] 54 24 3 9 2 4 9 15 3 6 ...
##  $ PCSTeamRanking        : num [1:20305] 12 16 24 5 5 24 6 9 14 9 ...
##  $ Rnk                   : num [1:20305] 24 95 121 31 84 66 103 33 96 43 ...
##  $ AvgSpeedWinner        : num [1:20305] 42.2 42.2 42.2 42.2 42.2 ...
##  $ Distance              : num [1:20305] 254 254 254 254 254 ...
##  $ StageNum              : num [1:20305] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ProfileScore          : num [1:20305] 112 112 112 112 112 112 112 112 112 112 ...
##  $ VertMeters            : num [1:20305] 3460 3460 3460 3460 3460 3460 3460 3460 3460 3460 ...
##  $ RaceRanking           : num [1:20305] 22 22 22 22 22 22 22 22 22 22 ...
##  $ StartlistQualScore    : num [1:20305] 702 702 702 702 702 702 702 702 702 702 ...
##  $ ParcourTypeCategorical: chr [1:20305] "HillFlat" "HillFlat" "HillFlat" "HillFlat" ...
##  $ MaxAlt                : num [1:20305] 271 271 271 271 271 271 271 271 271 271 ...
##  $ WinningTimeHours      : num [1:20305] 6 6 6 6 6 6 6 6 6 6 ...
##  $ WinningTimeMinutes    : num [1:20305] 1 1 1 1 1 1 1 1 1 1 ...
##  $ WinningTimeMin        : num [1:20305] 361 361 361 361 361 361 361 361 361 361 ...
##  $ WonByCategorical      : chr [1:20305] "SmallSprint" "SmallSprint" "SmallSprint" "SmallSprint" ...
##  $ NumStarted            : num [1:20305] 170 170 170 170 170 170 170 170 170 170 ...
##  $ NumFinished           : num [1:20305] 126 126 126 126 126 126 126 126 126 126 ...
##  $ Months                : num [1:20305] 4 4 4 4 4 4 4 4 4 4 ...
##  $ Days                  : num [1:20305] 10 10 10 10 10 10 10 10 10 10 ...
##  $ PercentFinished       : num [1:20305] 0.741 0.741 0.741 0.741 0.741 ...
##  $ DaysIntoYear          : num [1:20305] 132 132 132 132 132 ...
#barplot
ggplot(data=biking, aes(x=Rnk)) +
  geom_histogram(binwidth=1) +
  labs(title="Histogram of Finishing Position From Races", x="Finishing Position", y="Count")

#Finishing place versus PCS ranking
ggplot(biking, aes(x=PCSRanking, y=Rnk)) +
  geom_point(size=1) +
  geom_smooth(se=FALSE) +
  labs(title="Scatter Plot of PCS Ranking versus Finishing Place in Races", x="PCS Ranking", y="Finishing Place")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There is something curious to note about the finishing position. I would expect the bar chart to have a smooth trend to it (ie the number of finishers in position n never exceeds the number of finishers in position n-1) but this does not appear to be the case. This may have occurred due to removing NAs or may be a data entry error. I will continue to keep an eye out for this in my coming analysis. I also produced a scatter plot of PCS ranking versus finishing place. This graph shows what we would expect. As a rider has a lower PCS Ranking, lower finishing places are achieved. It is definitely worth noting, however, that the slope of the spline varies throughout the graph. It is very steep from a PCS ranking of one through about 150. Thus, the high ranked riders win a lot more than even riders ranked slightly lower. After that, the spline smooths out, showing that PCS ranking matters less here.

Now, I want to produce some graph about riders who finished in the top 10

#first, I am going to create a subset of data where we only include riders who finished in the top 10 of a race
bikingTop10 <- biking %>% 
  filter(Rnk<=10)

#now lets make a histogram of these rider's ages
ggplot(bikingTop10, aes(x=AgeYear)) +
  geom_histogram(binwidth=1) +
  labs(title="Histogram of the Ages of Riders who Finished in the Top 10", x="Age", y="Count")

#a histogram of teams of these riders
ggplot(bikingTop10, aes(x=factor(PCSTeamRanking))) +
  geom_bar(stat="count") +
  labs(title="Bar Plot of Team Ranking that Riders are on who Finished in the Top 10", x="Team Ranking (according to PCS)", y="Count")

The first graph, we can tell that age is approximately normally distributed around 28. It is interesting to note that there are not a lot of results at age 34. This is probably due to the fact that there are coincidentally few riders who are performing well at this age. The team ranking graph shows us what we would expect. Better ranked teams have more riders in the top 10 of races. It is interesting to note that this graph is not strictly decreasing. For example, the second ranked team has more finishes in the top 10 than the first ranked team. This shows that races aren’t all determined by who finished in the top 10, it is important who won the race as well.

Plots of race profile

#Race speed versus Startlist quality
ggplot(raceProfile, aes(x=StartlistQualScore, y=AvgSpeedWinner)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE) +
  labs(title="Predicting Race Speed based on Startlist Quality", x="Startlist Quality", y="Speed of Race (km/hr)")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(raceProfile, aes(x=factor(ParcourTypeCategorical), y=AvgSpeedWinner)) +
  geom_boxplot(outlier.shape=NA) +
  geom_jitter(alpha=0.1) +
  labs(title="Race Speed Across Different Types of Race", x="Type of Race", y="Race Speed (km/hr)")

For the scatterplot, there is a weak positive association. I would hesitate to conclude that startlist quality increases the speed of races. There may be some association with speed and startlist quality but it is a weak association.

For the boxplot, race speed varies a lot by the type of race. Note that for most race names, the first word indicates the general profile of the race while the second word indicates how the race finished (ie HillFlat would be a hilly race that ended flat while FlatHill would be a flat race that ended with a hill).

Correlation

Let’s test out correlation in our dataset.

#test correlation
cor_results <- biking %>% 
  select(-Rnk) %>% 
  correlate()
## Non-numeric variables removed from input: `raceName`, `RiderName`, `RiderLastName`, `Team`, `ParcourTypeCategorical`, and `WonByCategorical`
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
cor_results %>% 
  stretch() %>% 
  ggplot(aes(x,y,fill=r)) +
  geom_tile() +
  geom_text(aes(label=as.character(fashion(r)))) +
  theme(axis.text.x=element_text(angle=-90))

Overall, there are very few variables with super high correlation. A few variables have high correlation as expected. For example, winningTimeHours and winningTimeMin are highly correlated because winningTimeHours was used to calculate winningTimeMin. Also, on PCS race ranking is a function of startlist quality score so it makes sense that RaceRanking and StartlistQualScore have very high correlation. The same is true for VertMeters and ProfileScore. Some other interesting high correlations exist such as between the number of grand tours raced and the age of the rider. It makes sense for there to be a high positive correlation here but for the value to be so high is surprising. Also, winningTimeMin and distance having a correlation of 0.97 is notable. It makes sense that these two are highly correlated but to have a correlation of almost one is notable. It seems like other factors such as vertical meters or even how the riders raced the race would lower the correlation slightly. I am very happy, however, that I have so many variables that have relatively low correlation that I get to play with in my analysis.

Split data into regression dataset and classification dataset

I am unsure if my data will work better as a regression dataset or as a classification dataset. Thus, I have decided that I am going to try both. In bike races, riders compete heavily for the top 10 but after that riders are not overly worried about whether they finish in 42nd or in 72nd. Thus, for my classification dataset I am going to split the results into a top10 classification and then a notTop10 classification. I will then run models on this classification dataset. The code below is how I created the new classification dataset.

#copy the dataframe
bikingClass <- biking

#assign categories
bikingClass$Rnk <- ifelse(bikingClass$Rnk %in% c(1:10), "top10", "notTop10")

#plot of finishes in top 10 versus not in the top 10
ggplot(bikingClass, aes(x=factor(Rnk))) +
  geom_bar(stat="count") +
  labs(title="Count of Finishes in the Top 10 Versus not in the Top 10", x=NULL, y="Count")

#export classification
write_xlsx(bikingClass, here("Project", "rawData", "bikingClass.xlsx"))

I will need to conduct some upsampling so that my class sizes are closer to equal. I will upsample to an over_ratio of 0.5

Run the Classification Models

Now we will run the classification models. ### Read in dataset

bikeClass <- read_excel(here("Project", "rawData", "bikingClass.xlsx"))
bikingOG <- bikeClass #for the corr plot

Split data

First, we need to split the data into our training and testing set

#make the regressor variable (whether or not we are in the top 10) a factor
bikeClass$Rnk <- factor(bikeClass$Rnk)

#make the classification variables into factors
bikeClass$ParcourTypeCategorical <- factor(bikeClass$ParcourTypeCategorical)
bikeClass$WonByCategorical <- factor(bikeClass$WonByCategorical)

#remove unneccesary variable
bikeClass <- bikeClass %>% 
  select(-c(raceName, RiderName, RiderLastName, Team, WinningTimeHours, WinningTimeMinutes, Months, Days, VertMeters, StartlistQualScore))

I removed raceName, RiderName, RiderLastName, and Team because all of these variables are identification variables and will not be useful in our analysis. I removed WinningTimeHours and WinningTimeMinutes because these are both variables that were used to determine WinningTimeMin: (\(Hours+Minutes*60\)). I removed Months and Days because they were used to determine DaysIntoYear: (\(30.4*Months+Days\)). I removed VertMeters because of the high correlation with ProfileScore. I know that ProfileScore was calculated using VertMeters. I removed StartlistQualScore because of the high correlation with RaceRanking. I know that StartlistQualScore was calculated using RaceRanking. I believe that other variables with high correlation, such as GCScore and Climber score as well as WinningTimeMin and Distance are still useful for our analysis. For example, in the rare case that a rider has a low GCScore but a high Climber score that tells us that the rider is very good at climbing but bad at other characteristics of a GC rider such as time trialing. Also, in the case that WinningTimeMin and Distance do not correlate as predicted this tells us that something must have happened in the race that caused it to be a lot slower. This could provide some interesting insights.

Below is the correlation plot before variables were removed for reference.

#from EDA.rmd
cor_results <- bikingOG %>% 
  select(-Rnk) %>% 
  correlate()
## Non-numeric variables removed from input: `raceName`, `RiderName`, `RiderLastName`, `Team`, `ParcourTypeCategorical`, and `WonByCategorical`
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
cor_results %>% 
  stretch() %>% 
  ggplot(aes(x,y,fill=r)) +
  geom_tile() +
  geom_text(aes(label=as.character(fashion(r)))) +
  theme(axis.text.x=element_text(angle=-90))

We will proceed forward with splitting the data.

set.seed(110)

#split the dataset by 0.75
bikeSplit <- initial_split(bikeClass, prop=0.75, strata="Rnk")

#set into training and test
bikeTrain <- training(bikeSplit)
bikeTest <- testing(bikeSplit)

#create a ten fold cross validation
bikeFold <- vfold_cv(bikeTrain, v=10, strata="Rnk")

Building the recipe

Building the recipe for the bikeClass

#build the recipe
bikeRecipe <- recipe(Rnk~., data=bikeTrain) %>% 
  step_dummy(all_nominal_predictors()) %>% #dummy code all the categorical variables
  step_normalize(all_predictors()) %>% #set all numeric variables such that they have a mean of 0 and variance of 1
  step_upsample(Rnk, over_ratio=0.5) #upsample Rnk (our response variable) such that top 10 is equal to a quarter of notTopTen

I am well aware that my response variable is very unbalanced. Thus, to make up for this I am upsampling. However, I don’t want to overdo how much some variables may be copied so I only set the over_ratio to 0.5.

Model choice and fitting

I shall fit 5 models to my classification dataset. I first decided to do a logistic regression as a good base model. The next model I decided to used was a regularized regression logistic model in order to return the best possible logistic regression using a combination of Lasso and Ridge regularization as well as tuning the penalty hyperparameter. I also decided to use a K-nearest neighbors (KNN) model as this model is distinctly different than the other models. I finally decided to use a random forest and a boosted tree model as these models are the most advanced and I expect them to return the best results.

#Logistic regression
log_reg <- logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glm")

#Regularized regression
log_reg_reg <- logistic_reg(mixture=tune(),
                            penalty=tune()) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet")

#K-nearest neighbors
KNN <- nearest_neighbor(neighbors=tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

#random forest
rand_for <- rand_forest(mtry=tune(),
                        trees=tune(),
                        min_n=tune()) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

#boosted trees
boosted_for <- boost_tree(mtry=tune(),
                          trees=tune(),
                          learn_rate=tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

I am going to do some tuning in order to optimize by hyperparameters. For logistic regularized regression I chose 0 through 1 for penalty and mixture in order to get a wide range of penalties for the regularization and a wide range of mixtures between Lasso and Ridge. For KNN I chose one step increments between 2 and 12. I figure 1 is probably too specific but 2 through 12 seems to give a good range that will provide some useful insights. For my random forest, I chose the number of predictors to be randomly selected between each split to be between 1 and 16 because I have 26 predictors and want to get a good range of different values for how many predictors I have. I chose trees to be between 100 and 800 because this gives us lots of varying levels of forest size. For the minimum number of nodes I chose between 10 and 18 because this is a decent amount of nodes at each terminal branch but it is not too many. For the boosted tree, I used the same logic for the number of predictors selected at each split and trees as I used for the random forest. For learn rate I chose between -10 and -1 because this correlates to \(1*10^{-10}\) through \(0.1\) which gives us a wide variety of learning rates. I decided to drop the number of levels from 8 to 5 in order to decrease run time for the random forest and boosted forest. With 8 levels and three hyperparameters that we are trying to tune, this means that we would have \(8^{3}=512\) models to run on each fold. By dropping to 5 we only have \(5^{3}=125\) models to run which is a much more doable number.

#logistic regularized regression grid
log_reg_reg_grid <- grid_regular(penalty(range=c(0,1),
                                 trans=identity_trans()),
                                 mixture(range=c(0,1)),
                                 levels=5)
#KNN grid
KNN_grid <- grid_regular(neighbors(range=c(2,12)),
                         levels=11)
#random forest grid
rand_for_grid <- grid_regular(mtry(range=c(8,20)),
                              trees(range=c(100,800)),
                              min_n(range=c(10,18)),
                              levels=5)

#boosted tree grid
boosted_for_grid <- grid_regular(mtry(range=c(8,20)),
                                 trees(range=c(100,800)),
                                 learn_rate(range=c(-10,-1)),
                                 levels=5)

Now we set up the workflows

#workflow for the logistic regression model
log_reg_wkflw <- workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(bikeRecipe)

#workflow for the regularized regression
log_reg_reg_wkflw <- workflow() %>% 
  add_model(log_reg_reg) %>% 
  add_recipe(bikeRecipe)

#workflow for K-nearest neighbors
KNN_wkflw <- workflow() %>% 
  add_model(KNN) %>% 
  add_recipe(bikeRecipe)

#workflow for the random forest
rand_for_wkflw <- workflow() %>% 
  add_model(rand_for) %>% 
  add_recipe(bikeRecipe)

#workflow for the boosted tree
boosted_for_wkflw <- workflow() %>% 
  add_model(boosted_for) %>% 
  add_recipe(bikeRecipe)

Next, we fit the models. Note that this is going to take a while to run. Thus, I set eval to false so that I would not have to run this every time. I then saved the models at the end of this code so they could be loaded in later and analyzed without having to run this code every single time.

#model fit for the logistic regression model
log_reg_fit <- fit_resamples(
  object=log_reg_wkflw,
  resamples=bikeFold
)
#save the model
save(log_reg_fit, file=here("Project", "modelRuns", "class_log_reg_fit.rda"))

#model fit for regularized regression
log_reg_reg_fit <- tune_grid(
  object=log_reg_reg_wkflw,
  resamples=bikeFold,
  grid=log_reg_reg_grid,
  control=control_grid(verbose=TRUE)
)
save(log_reg_reg_fit, file=here("Project", "modelRuns", "class_log_reg_reg_fit.rda"))

#model fit for k-nearest-neighbors
KNN_fit <- tune_grid(
  object=KNN_wkflw,
  resamples=bikeFold,
  grid=KNN_grid,
  control=control_grid(verbose=TRUE)
)
save(KNN_fit, file=here("Project", "modelRuns", "class_KNN_fit.rda"))

#model fit for the random forest
rand_for_fit <- tune_grid(
  object=rand_for_wkflw,
  resamples=bikeFold,
  grid=rand_for_grid,
  control=control_grid(verbose=TRUE)
)
save(rand_for_fit, file=here("Project", "modelRuns", "class_rand_for_fit.rda"))

#model fit for boosted tree
boosted_for_fit <- tune_grid(
  object=boosted_for_wkflw,
  resamples=bikeFold,
  grid=boosted_for_grid,
  control=control_grid(verbose=TRUE)
)
save(boosted_for_fit, file=here("Project", "modelRuns", "class_boosted_for_fit.rda"))

Regression Models

Note: Much of the code is the same as the Classification Models from above. Commentary on the code has been removed where it would be redundant. The major difference in this code is that the response variable is no longer categorical, but is numeric. The response variable is simply the place the rider finished as a number. Commentary is still included when the code is novel.

Read in dataset

bikeReg <- read_excel(here("Project", "rawData", "bikingDF.xlsx"))

Split data

#make the classification variables into factors
bikeReg$ParcourTypeCategorical <- factor(bikeReg$ParcourTypeCategorical)
bikeReg$WonByCategorical <- factor(bikeReg$WonByCategorical)

#remove unneccesary variable
bikeReg <- bikeReg %>% 
  select(-c(raceName, RiderName, RiderLastName, Team, WinningTimeHours, WinningTimeMinutes, Months, Days, VertMeters, StartlistQualScore))

set.seed(110)

#split the dataset by 0.75
bikeSplit <- initial_split(bikeReg, prop=0.75, strata="Rnk")

#set into training and test
bikeTrain <- training(bikeSplit)
bikeTest <- testing(bikeSplit)

#create a ten fold cross validation
bikeFold <- vfold_cv(bikeTrain, v=10, strata="Rnk")

Building the recipe

#build the recipe
bikeRecipe <- recipe(Rnk~., data=bikeTrain) %>% 
  step_dummy(all_nominal_predictors()) %>% #dummy code all the categorical variables
  step_normalize(all_predictors()) #set all numeric variables such that they have a mean of 0 and variance of 1

Model choice and fitting

All the models here are the same as the classification models except for a few differences. First off, instead of a logistic regression we use a linear regression for the first model. For the second model, we are creating a regularized regression based on a linear regression instead of a logistic regression. For all the models, the mode is regression instead of classification.

#Linear regression
lin_reg <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("glm")

#Regularized regression
lin_reg_reg <- linear_reg(mixture=tune(),
                            penalty=tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

#K-nearest neighbors
KNN <- nearest_neighbor(neighbors=tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn")

#random forest
rand_for <- rand_forest(mtry=tune(),
                        trees=tune(),
                        min_n=tune()) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

#boosted trees
boosted_for <- boost_tree(mtry=tune(),
                          trees=tune(),
                          learn_rate=tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

The same hyperparameters were used from the classification models based on the same logic.

#logistic regularized regression grid
lin_reg_reg_grid <- grid_regular(penalty(range=c(0,1),
                                 trans=identity_trans()),
                                 mixture(range=c(0,1)),
                                 levels=5)
#KNN grid
KNN_grid <- grid_regular(neighbors(range=c(2,12)),
                         levels=11)
#random forest grid
rand_for_grid <- grid_regular(mtry(range=c(1,16)),
                              trees(range=c(100,800)),
                              min_n(range=c(10,18)),
                              levels=5)

#boosted tree grid
boosted_for_grid <- grid_regular(mtry(range=c(1,16)),
                                 trees(range=c(100,800)),
                                 learn_rate(range=c(-10,-1)),
                                 levels=5)

Now we set up the workflows

#workflow for the linear regression model
lin_reg_wkflw <- workflow() %>% 
  add_model(lin_reg) %>% 
  add_recipe(bikeRecipe)

#workflow for the regularized regression
lin_reg_reg_wkflw <- workflow() %>% 
  add_model(lin_reg_reg) %>% 
  add_recipe(bikeRecipe)

#workflow for K-nearest neighbors
KNN_wkflw <- workflow() %>% 
  add_model(KNN) %>% 
  add_recipe(bikeRecipe)

#workflow for the random forest
rand_for_wkflw <- workflow() %>% 
  add_model(rand_for) %>% 
  add_recipe(bikeRecipe)

#workflow for the boosted tree
boosted_for_wkflw <- workflow() %>% 
  add_model(boosted_for) %>% 
  add_recipe(bikeRecipe)

Next, we fit the models. Note that this is going to take a while to run. Thus, I set eval to false so that I would not have to run this every time. I then saved the models at the end of this code so they could be loaded in later and analyzed without having to run this code every single time.

#model fit for the linear regression model
lin_reg_fit <- fit_resamples(
  object=lin_reg_wkflw,
  resamples=bikeFold
)
#save the model
save(lin_reg_fit, file=here("Project", "modelRuns", "reg_lin_reg_fit.rda"))

#model fit for regularized regression
lin_reg_reg_fit <- tune_grid(
  object=lin_reg_reg_wkflw,
  resamples=bikeFold,
  grid=lin_reg_reg_grid,
  control=control_grid(verbose=TRUE)
)
save(lin_reg_reg_fit, file=here("Project", "modelRuns", "reg_lin_reg_reg_fit.rda"))

#model fit for k-nearest-neighbors
KNN_fit <- tune_grid(
  object=KNN_wkflw,
  resamples=bikeFold,
  grid=KNN_grid,
  control=control_grid(verbose=TRUE)
)
save(KNN_fit, file=here("Project", "modelRuns", "reg_KNN_fit.rda"))

#model fit for the random forest
rand_for_fit <- tune_grid(
  object=rand_for_wkflw,
  resamples=bikeFold,
  grid=rand_for_grid,
  control=control_grid(verbose=TRUE)
)
save(rand_for_fit, file=here("Project", "modelRuns", "reg_rand_for_fit.rda"))

#model fit for boosted tree
boosted_for_fit <- tune_grid(
  object=boosted_for_wkflw,
  resamples=bikeFold,
  grid=boosted_for_grid,
  control=control_grid(verbose=TRUE)
)
save(boosted_for_fit, file=here("Project", "modelRuns", "reg_boosted_for_fit.rda"))

Results

Read in the models

#read in the model runs from the classification code, rename to names that make sense
load(here("Project", "modelRuns", "class_boosted_for_fit.rda"))
class_boosted_for <- boosted_for_fit

load(here("Project", "modelRuns", "class_KNN_fit.rda"))
class_KNN <- KNN_fit

load(here("Project", "modelRuns", "class_log_reg_fit.rda"))
class_log_reg <- log_reg_fit

load(here("Project", "modelRuns", "class_log_reg_reg_fit.rda"))
class_log_reg_reg <- log_reg_reg_fit

load(here("Project", "modelRuns", "class_rand_for_fit.rda"))
class_rand_for <- rand_for_fit

#read in the model runs from the regression code, rename to names that make sense
load(here("Project", "modelRuns", "reg_boosted_for_fit.rda"))
reg_boosted_for <- boosted_for_fit

load(here("Project", "modelRuns", "reg_KNN_fit.rda"))
reg_KNN <- KNN_fit

load(here("Project", "modelRuns", "reg_lin_reg_fit.rda"))
reg_lin_reg <- lin_reg_fit

load(here("Project", "modelRuns", "reg_lin_reg_reg_fit.rda"))
reg_lin_reg_reg <- lin_reg_reg_fit

load(here("Project", "modelRuns", "reg_rand_for_fit.rda"))
reg_rand_for <- rand_for_fit

Analyzing how the classification models did

First, let’s see how the different models that we tuned performed across their tuning parameters

#use autoplot to see how the tuned models performed
autoplot(class_KNN)

autoplot(class_log_reg_reg)
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis

autoplot(class_rand_for)

autoplot(class_boosted_for)

Note that there is no plot of how the tuning did for the logistic regression did as we did not tune the logistic regression. Starting with K-nearest neighbors it appears that more neighbors lead to a better “area under the curve” (AUC). We should note, however, that accuracy drops between five and six neighbors and again between 10 and 11. It does not drop by a whole lot so I am not too worried. It appears to start leveling out around 11 or 12 so I believe that this number of neighbors may be the best decision. The regularized logistic regression provides some interesting insights. The best amount of penalty appears to be a low penalty. However, if we have no regularization, a complete ridge regularization, penalty does not matter and AUC appears to be just about the highest. The random forest graph has a lot going on. The most notable trend appears to be that as we increase the number of randomly selected predictors, AUC falls. Thus, having a low number of randomly selected predictors appears to be best. Also, it appears as if generally a higher number of trees is better but as long as you have more than 100 trees, the AUC does not vary a whole lot. Minimum node size does not appear to be super important on AUC. The boosted forest plots look a lot different than the random forest which is interesting. There is not as much variation across plots, they are mostly straight lines. Thus, for a boosted forest the number of randomly selected predictors does not appear to have a large influence on model performance. However, learning rate does. It appears that the best learning rate is the highest at 0.1. Also, other than the lowest learning rate, the number of trees in the forest does not matter as well. The only thing that appears to matter is the learning rate.

Let’s now see for each model set up, which specifically tuned model performed the best.

#show the best 10 performing models for each model
collect_metrics(class_log_reg) #no top 10 because no tuning was conducted
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.830    10 0.00350 Preprocessor1_Model1
## 2 roc_auc  binary     0.818    10 0.00664 Preprocessor1_Model1
show_best(class_KNN, n=10, metric='roc_auc')
## # A tibble: 10 × 7
##    neighbors .metric .estimator  mean     n std_err .config              
##        <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1        12 roc_auc binary     0.750    10 0.00626 Preprocessor1_Model11
##  2        11 roc_auc binary     0.747    10 0.00671 Preprocessor1_Model10
##  3        10 roc_auc binary     0.744    10 0.00659 Preprocessor1_Model09
##  4         9 roc_auc binary     0.738    10 0.00666 Preprocessor1_Model08
##  5         8 roc_auc binary     0.733    10 0.00745 Preprocessor1_Model07
##  6         7 roc_auc binary     0.726    10 0.00717 Preprocessor1_Model06
##  7         6 roc_auc binary     0.715    10 0.00622 Preprocessor1_Model05
##  8         5 roc_auc binary     0.706    10 0.00647 Preprocessor1_Model04
##  9         4 roc_auc binary     0.693    10 0.00768 Preprocessor1_Model03
## 10         3 roc_auc binary     0.676    10 0.00728 Preprocessor1_Model02
show_best(class_log_reg_reg, n=10, metric='roc_auc')
## # A tibble: 10 × 8
##    penalty mixture .metric .estimator  mean     n std_err .config              
##      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1    0       1    roc_auc binary     0.818    10 0.00665 Preprocessor1_Model21
##  2    0       0.5  roc_auc binary     0.818    10 0.00668 Preprocessor1_Model11
##  3    0       0.75 roc_auc binary     0.818    10 0.00668 Preprocessor1_Model16
##  4    0       0.25 roc_auc binary     0.818    10 0.00669 Preprocessor1_Model06
##  5    0       0    roc_auc binary     0.815    10 0.00752 Preprocessor1_Model01
##  6    0.5     0.25 roc_auc binary     0.809    10 0.00708 Preprocessor1_Model08
##  7    0.75    0.25 roc_auc binary     0.809    10 0.00708 Preprocessor1_Model09
##  8    0.25    0.5  roc_auc binary     0.809    10 0.00708 Preprocessor1_Model12
##  9    0.25    0.75 roc_auc binary     0.809    10 0.00708 Preprocessor1_Model17
## 10    0.25    0.25 roc_auc binary     0.802    10 0.00840 Preprocessor1_Model07
show_best(class_rand_for, n=10, metric='roc_auc')
## # A tibble: 10 × 9
##     mtry trees min_n .metric .estimator  mean     n std_err .config             
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
##  1     8   625    18 roc_auc binary     0.860    10 0.00573 Preprocessor1_Model…
##  2     8   275    14 roc_auc binary     0.859    10 0.00523 Preprocessor1_Model…
##  3     8   450    14 roc_auc binary     0.859    10 0.00539 Preprocessor1_Model…
##  4     8   800    12 roc_auc binary     0.859    10 0.00519 Preprocessor1_Model…
##  5     8   275    18 roc_auc binary     0.859    10 0.00560 Preprocessor1_Model…
##  6     8   800    16 roc_auc binary     0.859    10 0.00560 Preprocessor1_Model…
##  7     8   450    18 roc_auc binary     0.859    10 0.00552 Preprocessor1_Model…
##  8     8   275    16 roc_auc binary     0.859    10 0.00560 Preprocessor1_Model…
##  9     8   625    10 roc_auc binary     0.859    10 0.00511 Preprocessor1_Model…
## 10     8   625    16 roc_auc binary     0.859    10 0.00541 Preprocessor1_Model…
show_best(class_boosted_for, n=10, metric='roc_auc')
## # A tibble: 10 × 9
##     mtry trees learn_rate .metric .estimator  mean     n std_err .config        
##    <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
##  1    17   450        0.1 roc_auc binary     0.876    10 0.00595 Preprocessor1_…
##  2     8   450        0.1 roc_auc binary     0.876    10 0.00587 Preprocessor1_…
##  3     8   275        0.1 roc_auc binary     0.876    10 0.00551 Preprocessor1_…
##  4    11   275        0.1 roc_auc binary     0.875    10 0.00579 Preprocessor1_…
##  5    17   275        0.1 roc_auc binary     0.875    10 0.00510 Preprocessor1_…
##  6    14   450        0.1 roc_auc binary     0.875    10 0.00573 Preprocessor1_…
##  7    14   275        0.1 roc_auc binary     0.875    10 0.00544 Preprocessor1_…
##  8    20   275        0.1 roc_auc binary     0.875    10 0.00544 Preprocessor1_…
##  9     8   625        0.1 roc_auc binary     0.875    10 0.00594 Preprocessor1_…
## 10    14   625        0.1 roc_auc binary     0.874    10 0.00597 Preprocessor1_…

Comparing the best model across models it appears that the K-nearest neighbors performed the worst. Interestingly enough, the regularized logistic regression performed worse than the non-regularized version. The two forest models performed the best with the boosted forest slightly outperforming the random forest. Thus, the best model appears to be the boosted forest with 17 randomly selected predictors, 450 trees, and a learning rate of 0.1. Let’s test this model on the testing set!

#select the best boosted forest
best_class_mod_class <- select_best(class_boosted_for, metric='roc_auc')

#fit to training set
final_boosted_class <- finalize_workflow(boosted_for_wkflw_class, best_class_mod_class) %>% 
  fit(bikeTrainClass)

#evaluate on the testing set
best_class_testing <- augment(final_boosted_class, bikeTestClass) %>% 
  select(Rnk, starts_with(".pred"))

#evaluate the model
roc_auc(best_class_testing, truth=Rnk, .pred_notTop10)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.881

The model worked even better than on our training set! Let’s see the variable importance plot.

#variable importance plot
final_boosted_class %>%
  extract_fit_parsnip() %>% 
  vip(num_features=26)

This variable importance plot is super interesting. It makes sense that PCSRanking would be head and shoulders the best predictor of race results. This is the overall ranking of how good a rider is so obviously this would be a good predictor of rider performance. It makes sense that some of the next best predictors are the various components that say how good a rider is. It is interesting to see and makes me feel good about my model set up that the important variables are a mix from the rider profile dataset and the race profile dataset. It shows that the race set up and the individual rider play a major roll in how the race is won (or how riders get in the top 10). I am also happy to see that while there are definitely certain variables that are more important to others, all variables had some role to play Let’s look at the ROC curve.

#plot the ROC curve.
roc_curve(best_class_testing, truth=Rnk, .pred_notTop10) %>% 
  autoplot()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the yardstick package.
##   Please report the issue at <]8;;https://github.com/tidymodels/yardstick/issueshttps://github.com/tidymodels/yardstick/issues]8;;>.

The ROC curve looks about as we would expect. It is by no means perfect but it does a lot better than just random guessing. Let’s create a confusion matrix next.

#confusion matrix
conf_mat(best_class_testing, truth=Rnk, .pred_class) %>% 
  autoplot(type="heatmap")

The confusion matrix reveals a lot about this model. It looks like the model got very good at predicting when a rider would not finish in the top 10 and was very hesitant to guess that a rider would finish in the top 10. The model did have a pretty low error rate when it comes to predicting a rider to finish in the top 10 given that the rider did finish in the top 10 (note that when I say “pretty low” I mean that it has a below 50% error rate, which is not great). This definitely reveals that the model is not doing as well as the ROC says it was doing.

Now to see how the regression models did.

Analyzing how the regression models did

First, let’s see how the different models that we tuned performed across their tuning parameters

#use autoplot to see how the tuned models performed
autoplot(reg_KNN)

autoplot(reg_lin_reg_reg)
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis

autoplot(reg_rand_for)

autoplot(reg_boosted_for)

My initial reaction is that throughout we have a low r-squared value. It appears that none of our models did a very good job at predicting race outcome. For k-nearest neighbors the models follow the same trend that the KNN model did for the classification models, the more neighbors the better the model performed. For our regularized linear regression, we see another similar trend. Zero regularization, a complete ridge regression, seems to perform best across penalties. As regularization increases, the best models are those with a low penalty. Our random forest looks considerably different. First off, the number of trees and the number of nodes at a terminal branch does not appear to affect our r-squared value. The number of predictors also does not vary a whole lot as long as you do not pick two. Number of predictors greater than two works almost equally well across models. Our boosted forest shows some interesting results as well. The best r-squared values occur when our learning rate is 0.1. With a learning rate of 0.1, the models with more trees appear to perform better. We also see a similar trend in the random forest with the number of randomly selected predictors. As long as the number of predictors is greater than two, they all performed somewhat equally well.

Let’s now see for each model set up, which specifically tuned model performed the best.

#show the best 10 performing models for each model
collect_metrics(reg_lin_reg) #no top 10 because no tuning was conducted
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   38.8      10 0.0910  Preprocessor1_Model1
## 2 rsq     standard    0.238    10 0.00397 Preprocessor1_Model1
show_best(reg_KNN, n=10, metric='rsq')
## # A tibble: 10 × 7
##    neighbors .metric .estimator  mean     n std_err .config              
##        <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1        12 rsq     standard   0.297    10 0.00593 Preprocessor1_Model11
##  2        11 rsq     standard   0.294    10 0.00590 Preprocessor1_Model10
##  3        10 rsq     standard   0.290    10 0.00587 Preprocessor1_Model09
##  4         9 rsq     standard   0.285    10 0.00581 Preprocessor1_Model08
##  5         8 rsq     standard   0.279    10 0.00572 Preprocessor1_Model07
##  6         7 rsq     standard   0.272    10 0.00561 Preprocessor1_Model06
##  7         6 rsq     standard   0.264    10 0.00556 Preprocessor1_Model05
##  8         5 rsq     standard   0.253    10 0.00559 Preprocessor1_Model04
##  9         4 rsq     standard   0.239    10 0.00561 Preprocessor1_Model03
## 10         3 rsq     standard   0.219    10 0.00538 Preprocessor1_Model02
show_best(reg_lin_reg_reg, n=10, metric='rsq')
## # A tibble: 10 × 8
##    penalty mixture .metric .estimator  mean     n std_err .config              
##      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1    0       1    rsq     standard   0.238    10 0.00397 Preprocessor1_Model21
##  2    0.25    0.25 rsq     standard   0.238    10 0.00392 Preprocessor1_Model07
##  3    0       0.75 rsq     standard   0.238    10 0.00396 Preprocessor1_Model16
##  4    0       0.5  rsq     standard   0.238    10 0.00397 Preprocessor1_Model11
##  5    0       0.25 rsq     standard   0.238    10 0.00396 Preprocessor1_Model06
##  6    0.25    0.5  rsq     standard   0.238    10 0.00393 Preprocessor1_Model12
##  7    0.5     0.25 rsq     standard   0.238    10 0.00393 Preprocessor1_Model08
##  8    0       0    rsq     standard   0.238    10 0.00389 Preprocessor1_Model01
##  9    0.25    0    rsq     standard   0.238    10 0.00389 Preprocessor1_Model02
## 10    0.5     0    rsq     standard   0.238    10 0.00389 Preprocessor1_Model03
show_best(reg_rand_for, n=10, metric='rsq')
## # A tibble: 10 × 9
##     mtry trees min_n .metric .estimator  mean     n std_err .config             
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
##  1     8   625    10 rsq     standard   0.470    10 0.00593 Preprocessor1_Model…
##  2     8   450    10 rsq     standard   0.470    10 0.00588 Preprocessor1_Model…
##  3     8   800    10 rsq     standard   0.470    10 0.00619 Preprocessor1_Model…
##  4    12   800    10 rsq     standard   0.470    10 0.00609 Preprocessor1_Model…
##  5    12   625    10 rsq     standard   0.469    10 0.00594 Preprocessor1_Model…
##  6    12   450    10 rsq     standard   0.469    10 0.00606 Preprocessor1_Model…
##  7    16   625    10 rsq     standard   0.469    10 0.00590 Preprocessor1_Model…
##  8     8   275    10 rsq     standard   0.469    10 0.00596 Preprocessor1_Model…
##  9    12   275    10 rsq     standard   0.469    10 0.00581 Preprocessor1_Model…
## 10    16   800    10 rsq     standard   0.469    10 0.00604 Preprocessor1_Model…
show_best(reg_boosted_for, n=10, metric='rsq')
## # A tibble: 10 × 9
##     mtry trees learn_rate .metric .estimator  mean     n std_err .config        
##    <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
##  1     8   625        0.1 rsq     standard   0.486    10 0.00651 Preprocessor1_…
##  2     8   800        0.1 rsq     standard   0.485    10 0.00663 Preprocessor1_…
##  3     8   450        0.1 rsq     standard   0.485    10 0.00621 Preprocessor1_…
##  4    16   450        0.1 rsq     standard   0.485    10 0.00625 Preprocessor1_…
##  5    16   625        0.1 rsq     standard   0.485    10 0.00639 Preprocessor1_…
##  6    12   625        0.1 rsq     standard   0.484    10 0.00673 Preprocessor1_…
##  7    12   800        0.1 rsq     standard   0.484    10 0.00666 Preprocessor1_…
##  8    12   450        0.1 rsq     standard   0.484    10 0.00650 Preprocessor1_…
##  9    16   800        0.1 rsq     standard   0.483    10 0.00603 Preprocessor1_…
## 10     4   800        0.1 rsq     standard   0.482    10 0.00623 Preprocessor1_…

Unlike the classification models, the K-nearest neighbors model outperformed the regularized and unregularized linear regression. The best regularized linear regression slightly outperformed the un-regularized linear regression. Similar to the classification models, the best models were the random forest models. It is also notable just by how much the random forest models out performed the other models. Once again, the boosted forest slightly outperformed the normal random forest. Thus, we will use this model in the rest of our analysis.

Let’s test this model on the testing set!

#select the best boosted forest
best_class_mod_reg <- select_best(reg_boosted_for, metric='rsq')

#fit to training set
final_boosted_reg <- finalize_workflow(boosted_for_wkflw_reg, best_class_mod_reg) %>% 
  fit(bikeTrainReg)

#evaluate on the testing set
best_reg_testing <- augment(final_boosted_reg, bikeTestReg) %>% 
  select(Rnk, starts_with(".pred"))

#evaluate the model
rsq(best_reg_testing, truth=Rnk, estimate=.pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.480

The model worked about the same as our testing set. While it did not perform the best, we were asking the model to do a lot so I am not super disappointed in this result.

#variable importance plot
final_boosted_reg %>%
  extract_fit_parsnip() %>% 
  vip(num_features=26)

I am interested to see that the variable importance plot looks somewhat different than the variable importance plot from the classification data. We have the same top performers but in addition to the drop from PCSRanking to ClimberScore there is another drop from GCScore to NumFinished. This means that the model was using GCScore and ClimberScore more than just PCSRanking. In fact, PCSRanking dropped from having an importance of around 0.27 to having an importance of 0.17.

Let’s create a predicted vs actual value plot

best_reg_testing %>% 
  ggplot(aes(x=.pred, y=Rnk)) +
  geom_point(alpha=0.4, size=0.5) +
  geom_abline(lty=2) +
  labs(title="Predicted and Actual Values of Rank", x="Prediction", y="Actual Value")

Overall, it looks like the model performed fairly well. It is definitely worth noting that the algorithm does not seem to understand the concept of ranges. For example, it is impossible for you to finish in a place that is negative, which the model did predict a handful of times. All in all, the model did a pretty good job of getting a general range of where people would finish. I think it is worth noting that it appears as if there are more values below the line at low values and more values above the line at high values. I think this is because the model was not great at predicting riders who surprised and did well. The model also was not good at predicting when a rider disappoints by doing worse than they should have. This is because the model has no way of knowing why they disappointed. For example, it could have been because the favorite crashed which the model would have no way of knowing.

Predicting MSR

Let’s see how the boosted tree regression model does at predicting Milano Sanremo, one of the first major races of the 2023 calendar that happened on Saturday, March 18th. I only included the rider profile of the favorites for the race.

#read in the dataset
MSR2023 <- read_excel(here("Project", "rawData", "MSR2023.xlsx"))
MSR2023conv <- MSR2023

#remove unnecessary variables and factorize chategorical variables
MSR2023$ParcourTypeCategorical <- factor(MSR2023$ParcourTypeCategorical)
MSR2023$WonByCategorical <- factor(MSR2023$WonByCategorical)
MSR2023 <- MSR2023 %>% 
  select(-c(raceName, RiderLastName, Team, WinningTimeHours, WinningTimeMinutes, Months, Days, VertMeters, StartlistQualScore))

#evaluate on the boosted tree model
MSRPred <- augment(final_boosted_reg, MSR2023) %>% 
  select(Rnk, starts_with(".pred"), RiderName)

#Let's see how the model did
head(MSRPred)
## # A tibble: 6 × 3
##     Rnk .pred RiderName                                    
##   <dbl> <dbl> <chr>                                        
## 1     4  21.1 POGAČAR TadejUAE Team Emirates               
## 2     1  26.5 VAN DER POEL MathieuAlpecin-Fenix            
## 3     3  27.4 VAN AERT WoutJumbo-Visma                     
## 4     2  76.8 GANNA FilippoINEOS Grenadiers                
## 5     8  34.7 MOHORIČ MatejBahrain - Victorious            
## 6    11  28.9 ALAPHILIPPE JulianQuick-Step Alpha Vinyl Team

I don’t think the model did that bad! Predicting Pogacar as the favorite is probably a good call, most betting markets were placing him as the favorite. Filippo Gana getting such a high prediction finishing result is not unreasonable. While people definitely had him down as an outsider favorite, he has not previously performed at this level in this type of race. A factor that made him a heavier favorite is the fact that his teammate, Tom Pidcock, was ruled out of the race last minute elevating him to sole leader on the team. This means that he had the full resources of the team behind him to push for the win. This factor was not included in our model. Van der Poel being second favorite out of this list and then winning is a very reasonable prediction by the model. He hasn’t had any standout performances for the road biking season yet so it was somewhat surprising to see him in such good form and winning the race. Van der Poel celebrating his victory

Maybe buy a bigger couch next time? (From left to right: Ganna, Van der Poel, Van Aert)

Conclusion

Overall, I would say that the models performed better than expected. There are a lot of variables that go into a bike race and the models were able to do the best they could. There are a lot of important aspects to a bike race that was not able to be contained in the models. Some of the factors missing include whether the rider crashed, how the rider woke up feeling, whether the rider was in their top shape, and many more. Overall, the boosted tree model performed the best across both types of models. This probably should not be super surprising as the boosted tree model was the most advanced model that we used. I believe that the regression model worked better than the classification model. Because of how the classification model was split, the model worked to be really good at predicting when someone would not finish in the top ten instead of trying to figure out when someone would actually finish in the top ten. With a startlist of normally close to 200 riders per race and just a handful of favorites, it is not very profound to predict that a certain rider would not finish in the top ten. We want to know who is going to be in the top ten! If I were to run the classification problem again, I would at the very least change the over_ratio to a value closer to 1. I also would consider creating multiple classifications. This way the model would focus on more than just the riders in the top 10. While the regression model was not perfect, I do believe it worked better. It also provides some power in that if you are trying to predict which out of two riders will perform better, a numeric value that will be larger or smaller than another numeric value provides insights. This is better than the classification model where two favorites may just produce the result of “top10” for both riders. For both models, the most important variable by far was PCSRanking. This is probably because PCSRanking is a function of how well a rider does throughout the year. This reveals a potential weakness in our model. Because PCSRanking is a function of results, the results that we are using as the response variable will be a factor in PCSRanking. This is probably why PCSRanking was the most important variable. It is important to note that PCSRanking is determined by how well the rider did throughout the year, not solely how well a rider did in one race. If I were to run the models again, I would consider removing the PCSRanking variable and other variables that are a fucntion of how well the rider has done in races that are the response variable. Another way to do this would be to imput PCSRanking as a value before the race result was known. This would required somewhat complicated coding but would provide better insights. I believe that by the fact that there are more important variables based on the variable importance plot in the regression boosted model than in the classification boosted model reveals that the regression model was using more variables to make its conclusion. This is a good thing because it shows that the regression model was thinking about things beyond a riders ranking in order to make its decision on rank. I would be interested to analyze how the model performance would change throughout years. If I used this model to predict riders racing this year, would it do worse than last year? I think it would because the model that I built was based on racing in 2022. The racing strategy will probably change between 2022 and 2023, if ever so slightly. If we compounded this out to ten years then I think this effect would be more pronounced. If I were to build this model again, I would like to make sure the variables I use are not dependent on the response variable. I would also be interested to try to add variables such as “number of crashes” or “number of mechanicals”. I think this could provide insight into the race. I also think it would be interesting to introduce a variable that says something about momentum. A variable that has a value that shows how a rider has done in the past 10ish races would be cool because then riders who are on a winning streak could be predicted to keep winning by the model. Overall, this was a big problem for a machine learning algorithm to solve. I think it did the best it could with the information given but a lot of information is very difficult to quantify and thus would be difficult for a machine learning algorithm to use.

Sources

Data was obtained for this project from ProCyclingStats. I downloaded the data and compiled it into the bikingClass spreadsheet and the raceProfile spreadsheet. I then downloaded all the race results and merged these three spreadsheets together to form a final dataframe.